Author: Alan O’Callaghan (alan.b.ocallaghan@gmail.com)
This vignette is intended to provide an overview of the use of heatmaply in the analysis of biological data. In particular, this vignette deals with its use to visualize gene expression patterns and sample relationships in RNAseq data relating to breast cancer samples, as retrieved from from Genomic Data Commons.
# Let's load the packages
library(heatmaply)
library(heatmaplyExamples)The data used in this vignette is available within the package. A full explanation of how this data was retrieved from the Genomic Data Commons is available in the data preprocessing vignette.
The R package limma is a commonly used statistical analysis tool using empirical Bayes methods. This package was originally developed for micro-array data. voom is a function within this package which transforms RNAseq count data in log2 counts per million reads, which allows it to be treated similarly. This function was used in the present example to provide normalized expression values. While pre-processing the data, quantile and TMM normalization were also applied. To compare these normalized values to raw read counts, the raw read counts were log2-transformed. 0.5 was added to prevent infinite values resulting from log2(0).
A common workflow for RNAseq differential expression analysis is to visualize gene expression measures and sample-sample correlations using a heatmap. This is useful to observe whether samples appear to cluster together, for the purposes of identifying poor quality samples or outliers. Furthermore, it may be useful to visualize differentially-expressed genes alongside row annotations which indicate patient or sample sub-group.
The following examples show sample-sample correlation based on all genes measured. Clustering in both instances appears to be related to membrane receptor subtypes, as shown in the row annotation.
When using the heatmaply function, notice the use of the showticklabels argument - by turning the labels off, the resulting heatmap is much lighter and allows fast zoom-in. The actual values can still be identified when hovering the mouse over the cells. Also, notice the use of the default color scheme chosen so to emphasize the correlation (which has low variability).
cor_mat_raw_logged <- cor(log2(raw_expression + 0.5))
heatmaply(cor_mat_raw_logged,
row_side_colors = tcga_brca_clinical,
main = 'Sample-sample correlation, log2 counts',
showticklabels = c(FALSE, FALSE),
# k_row = 5, k_col = 5, limits = c(0,1),
plot_method = 'plotly')cor_mat_voomed <- cor(voomed_expression)
heatmaply(cor_mat_voomed,
row_side_colors = tcga_brca_clinical,
main = 'Sample-sample correlation, log2 CPM',
showticklabels = c(FALSE, FALSE),
plot_method = 'plotly')Breast cancer has been studied extensively using gene expression profiling methods (see Breast cancer gene expression). This has lead to the identification of a number of gene sets which stratify patients into molecular subgroups. One such gene set is commonly known as the PAM50 gene set. In this example, we will visualize gene expression patterns using heatmaply.
Centering data before performing clustering tends to result in more meaningful cluster assignment. This is particularly true when the measure of interest is the similarity in patterns across features, rather than the total distance between values. Furthermore, it is typically the difference between samples which is of interest, rather than the difference between measures. Non-centered data may show that all samples measure high for one variable, and low for another, while centered data shows relative differences. Alternatively, one could use a distance measure which is invariant to total distance, such as correlation. Heatmaps of non-centered are shown in another vignette within this package. It can be seen that the concordance between the centered and non-centered heatmaps is mediocre, and the clustering of triple negative samples is not as definite.
In the heatmaps shown below, it is clear that samples appear to cluster loosely based on PAM50 subtype more than the previous examples. Concordance with the assigned labels shown in the row annotation is not complete, however this may be expected, given that a different clustering method was used here (hierarchical clustering, rather than k-medioids).
pam50_genes <- intersect(pam50_genes, rownames(raw_expression))
raw_pam50_expression <- raw_expression[pam50_genes, ]
voomed_pam50_expression <- voomed_expression[pam50_genes, ]
center_raw_mat <- cor_mat_raw_logged -
apply(cor_mat_raw_logged, 1, median)
raw_max <- max(abs(center_raw_mat), na.rm=TRUE)
raw_limits <- c(-raw_max, raw_max)
heatmaply(t(center_raw_mat),
row_side_colors = tcga_brca_clinical,
showticklabels = c(FALSE, FALSE),
fontsize_col = 7.5,
col = cool_warm(100),
main = 'Centred log2 read counts, PAM50 genes',
limits = raw_limits,
plot_method = 'plotly')Note that heatmaply_cor is just like heatmaply but with defaults that are better suited for correlation matrix (limits from -1 to 1, and a cold-warm color scheme).
heatmaply_cor(cor(center_raw_mat),
row_side_colors = tcga_brca_clinical,
showticklabels = c(FALSE, FALSE),
main = 'Sample-sample correlation based on centred, log2 PAM50 read counts',
plot_method = 'plotly')Following normalization, gene expression patterns appear roughly similar. This indicates that relative expression levels have not been altered unduly. Furthermore, slightly increased concordance with the pre-assigned cluster labels is observed in normalized data. Samples appear to cluster based Sample-sample correlation appears to show less concordance with row annotations than clustering based on gene expression. However, the use of different linkage criteria or distance measures may alter the observed clusters.
center_voom_mat <- voomed_pam50_expression -
apply(voomed_pam50_expression, 1, median)
voom_max <- max(abs(center_voom_mat))
voom_limits <- c(-voom_max, voom_max)
heatmaply(t(center_voom_mat),
row_side_colors=tcga_brca_clinical,
fontsize_col = 7.5,
showticklabels = c(TRUE, FALSE),
col = cool_warm(50),
limits = voom_limits,
main = 'Normalised, centred log2 CPM, PAM50 genes',
plot_method = 'plotly')heatmaply_cor(cor(center_voom_mat),
row_side_colors = tcga_brca_clinical,
showticklabels = c(FALSE, FALSE),
main = 'Sample-sample correlation based on centred, normalised PAM50 gene expression',
plot_method = 'plotly')It may be useful when examining expression heatmaps to identify particularly high or low measures for a single gene in a group of patients, or a gene which shows unusually high or low variance. The mouse-over text available in the heatmaply package allows visual assessment of measures of interest and quick identification of samples or genes with unusual gene expression patterns. Similarly, visualizing correlation heatmaps with heatmaply allows the user to rapidly identify samples with unusually high or low pairwise correlation.
sessionInfo()#> R version 3.4.1 (2017-06-30)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 7 x64 (build 7601) Service Pack 1
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=Hebrew_Israel.1255 LC_CTYPE=Hebrew_Israel.1255
#> [3] LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C
#> [5] LC_TIME=Hebrew_Israel.1255
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices datasets utils methods
#> [8] base
#>
#> other attached packages:
#> [1] gplots_3.0.1 limma_3.22.7
#> [3] ALL_1.7.1 Biobase_2.26.0
#> [5] BiocGenerics_0.12.1 dendextend_1.6.0
#> [7] glmnet_2.0-10 foreach_1.4.3
#> [9] Matrix_1.2-10 bindrcpp_0.2
#> [11] knitr_1.16 heatmaplyExamples_0.1.0
#> [13] heatmaply_0.11.0 viridis_0.4.0
#> [15] viridisLite_0.2.0 plotly_4.7.1
#> [17] ggplot2_2.2.1.9000 installr_0.19.0
#> [19] stringr_1.2.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr_1.2.1 tidyr_0.6.3 jsonlite_1.5
#> [4] gtools_3.5.0 shiny_1.0.3 assertthat_0.2.0
#> [7] stats4_3.4.1 yaml_2.1.14 robustbase_0.92-7
#> [10] backports_1.1.0 lattice_0.20-35 glue_1.1.1
#> [13] digest_0.6.12 RColorBrewer_1.1-2 colorspace_1.3-2
#> [16] httpuv_1.3.5 htmltools_0.3.6 plyr_1.8.4
#> [19] devtools_1.13.2 pkgconfig_2.0.1 xtable_1.8-2
#> [22] purrr_0.2.2.2 mvtnorm_1.0-6 scales_0.5.0
#> [25] gdata_2.18.0 whisker_0.3-2 tibble_1.3.4
#> [28] mgcv_1.8-17 withr_2.0.0 nnet_7.3-12
#> [31] lazyeval_0.2.0 mime_0.5 magrittr_1.5
#> [34] mclust_5.3 memoise_1.1.0 evaluate_0.10.1
#> [37] nlme_3.1-131 MASS_7.3-47 class_7.3-14
#> [40] tools_3.4.1 registry_0.3 data.table_1.10.4
#> [43] trimcluster_0.1-2 kernlab_0.9-25 munsell_0.4.3
#> [46] cluster_2.0.6 fpc_2.1-10 compiler_3.4.1
#> [49] caTools_1.17.1 rlang_0.1.2 grid_3.4.1
#> [52] iterators_1.0.8 htmlwidgets_0.9 crosstalk_1.0.0
#> [55] labeling_0.3 bitops_1.0-6 rmarkdown_1.6
#> [58] gtable_0.2.0 codetools_0.2-15 flexmix_2.3-14
#> [61] reshape2_1.4.2 TSP_1.1-5 R6_2.2.2
#> [64] seriation_1.2-2 gridExtra_2.2.1 prabclus_2.2-6
#> [67] dplyr_0.7.3.9000 bindr_0.1 rprojroot_1.2
#> [70] KernSmooth_2.23-15 modeltools_0.2-21 stringi_1.1.5
#> [73] Rcpp_0.12.12 gclus_1.3.1 DEoptimR_1.0-8
#> [76] diptest_0.75-7